PittPatt Face Detection and Tracking for the CLEAR 2006 Evaluation
نویسندگان
چکیده
This paper describes Pittsburgh Pattern Recognition’s participation in the face detection and tracking tasks for the CLEAR 2007 evaluation. Since CLEAR 2006, we have made substantial progress in optimizing our algorithms for speed, achieving better than real-time processing performance for a speed-up of more than 500× over the past two years. At the same time, we have maintained the high level of accuracy of our algorithm. In this paper, we first give a system overview, briefly explaining the three main stages of processing: (1) frame-based face detection; (2) motion-based tracking; and (3) track filtering. Second, we report our results, both in terms of accuracy and speed, over the two test data sets: (1) the CHIL Seminar corpus, and (2) the VACE Multisite Meeting corpus. Finally, we offer some analysis on both speed and accuracy performance, and how these compare with our performance in CLEAR 2006. 1 System Description Similar to our work for the CLEAR 2006 evaluation [1], our processing for CLEAR 2007 proceeds in three stages: (1) frame-based face detection; (2) motionbased tracking; and (3) track filtering. However, over the past couple of years, we have introduced a large number of changes and additions targeted at accelerating speed performance while maintaining system accuracy. Below, we give a full description for each processing stage, focusing particularly on system modifications implemented since CLEAR 2006. 1.1 Frame-based Face Detection Face finding: At the heart of our system lies PittPatt’s robust face finder, available for single-image testing through our web demo at http://demo.pittpatt. com. Conceptually, this current version of the detection algorithm builds on 1 For the evaluation, two parameter settings differ from the default settings on the web demo. First, we configured the face finder to search for faces with an interocular distance as small as four pixels, approximately 50% smaller than for the web the approach described in [2][3]; however, we have implemented large-scale improvements, both at the algorithm and code level, to dramatically boost speed performance. First, the detector has been radically re-designed to speed up performance algorithmically through: 1. Sharing of common computation across multiple detectors (e.g. frontal, profile, tilted); 2. Reduced stage-1 complexity within the detector; 3. Replaced vector quantization with sparse coding to improve speed of probability table look-ups; 4. Reduced the size of probability models (i.e. histograms) to minimize memory bottlenecks; and, 5. Improved termination of the classifier search in late stages of the detector. Second, we re-engineered substantial portions of the code to minimize computational bottlenecks. This effort has paid off most significantly for expensive inner-loop computations (e.g. the wavelet transform) which have been largely re-written directly in assembly. When practical, these new code segments parallelize vector computations through Intel’s SSE instruction set extensions, and minimize cache misses through improved memory access patterns. Third, we have implemented two distinct parallelization schemes that allow the detector to exploit the resources of multi-core and/or multi-CPU platforms. For real-time systems, where response time is critical, parallelization occurs at the video-frame level, such that processing for each frame is distributed across all available processors and video frames are processed in order. The main disadvantage of this approach, however, is that this type of parallelization introduces communication overhead between processors, and does not trivially scale to large core/CPU counts. Therefore, we implemented a second buffered parallelization scheme for stored media that allocates one video frame per processor. While this can result in short-term, out-of-order processing of video frames, and, consequently, requires buffering of the input video, this approach vastly reduces communication overhead and, as such, is highly scalable to a large number of processors. The speed results reported in this paper correspond to this second, buffered implementation. For each detected face, we retain the following meta-data: (1) face-center location (x, y); (2) face size s; (3) one of five possible pose categories – namely, frontal, right/left profile, and ±30◦ tilted; and (4) classifier confidence c, c ≥ 0.25. demo. Second, we set our normalized log-likelihood (ll) threshold to be -0.75 (instead of 0.0). While this lower setting generates more false alarms, it also permits more correct detections. Later processing across frames (described in Secs. 1.2 and 1.3) is able to eliminate most of the introduced false alarms, while preserving more correctly detected faces. 2 Earlier stages in the detector typically consume more CPU cycles since they have to process all and/or larger portions of the image position-scale space. 3 Classifier confidence c is related to the detection log-likelihood ll by c = ll + 1. Selective visual attention: In previous evaluations (e.g. VACE 2005, CLEAR 2006), the face finder processed every video frame at all positions and scales, independent of previous-frame detection results or the level of change between consecutive video frames. This approach is quite wasteful in terms of computation, since video contains an enormous amount of visually redundant information across time. As such, for the CLEAR 2007 evaluation, we implemented selective visual attention, an algorithm that focuses computational effort for the detector on the most necessary (i.e. changed) parts of each frame. In broad terms, the algorithm proceeds as follows. First, we periodically process a full frame independent of the properties of the input data. These key frames provide a reference point for subsequent frames and prevent detection errors from propagating beyond limited time spans. For this evaluation, we set the key-frame rate to 15. Then, for all intermediate frames, we determine those regions that have changed sufficiently since the last key frame so as to require examination by the face finder. We apply the face finder only to these regions of change, and then merge the resulting partial face-finder output with the full-frame results from the last processed key frame. Fig. 1 illustrates our selective attention algorithm for two sample video frames. While this example results in only one region of change [Fig. 1(a)-(e)], the selective attention algorithm frequently generates multiple, spatially disjoint regions, which our system trivially accommodates. In the merging of partial results with key-frame results [Fig. 1(f)-(h)], we arbitrate the combined output to eliminate possible duplicate detections for the same face. 1.2 Motion-based Tracking In motion-based tracking, we exploit the spatio-temporal continuity of video to combine single-frame observations into face tracks, each of which is ultimately associated with a unique subject ID. For this evaluation, the tracking algorithm is substantially different from the algorithm applied in prior evaluations. First, the new algorithm is causal; previously we tracked both forward and backward in time [1]. Second, we now utilize a globally optimal matching algorithm [4] [5] for associating observations across frames, similar to that used for matching ground truth data to system output by CLEAR evaluators. Third, we have re-engineered the code for faster performance, primarily through the implementation of efficient, dynamic linked data structures. Below, we describe the revised tracking algorithm in greater detail. Motion model: Let (zt, ct), zt = [xt, yt, st] T , denote the face location and size, and the classifier confidence in frame t for a given person. Now, assume that we have a collection of these observations for that person for t ∈ [0 . . . T ], and, furthermore, assume that the person’s motion is governed by a second-order motion model: ẑt = a0 + a1t+ a2t 2 (1)
منابع مشابه
The CLEAR 2006 Evaluation
This paper is a summary of the first CLEAR evaluation on CLassification of Events, Activities and Relationships which took place in early 2006 and concluded with a two day evaluation workshop in April 2006. CLEAR is an international effort to evaluate systems for the multimodal perception of people, their activities and interactions. It provides a new international evaluation framework for such...
متن کاملA Joint System for Single-Person 2D-Face and 3D-Head Tracking in CHIL Seminars
We present the IBM systems submitted and evaluated within the CLEAR’06 evaluation campaign for the tasks of single person visual 3D tracking (localization) and 2D face tracking on CHIL seminar data. The two systems are significantly inter-connected to justify their presentation within a single paper as a joint vision system for single person 2D-face and 3D-head tracking, suitable for smart room...
متن کاملSpeaker Tracking Using an Audio-visual Particle Filter
We present an approach for tracking a lecturer during the course of his speech. We use features from multiple cameras and microphones, and process them in a joint particle filter framework. The filter performs sampled projections of 3D location hypotheses and scores them using features from both audio and video. On the video side, the features are based on foreground segmentation, multi-view fa...
متن کاملAn Audio-Visual Particle Filter for Speaker Tracking on the CLEAR'06 Evaluation Dataset
We present an approach for tracking a lecturer during the course of his speech. We use features from multiple cameras and microphones, and process them in a joint particle filter framework. The filter performs sampled projections of 3D location hypotheses and scores them using features from both audio and video. On the video side, the features are based on foreground segmentation, multi-view fa...
متن کاملTsinghua Face Detection and Tracking for CLEAR 2007 Evaluation
This paper presents the algorithm and evaluation results of a face detection and tracking system. A tree-structured multi-view face detector trained by Vector Boosting is used as the basic detection module. Once a new face are detected, a track is initialized and maintained by detection confidence and Lucas-Kanade features, which are fused by particle filter. Additionally, a post process is ado...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006